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Abstract 

We investigate the problem of learning a topic model - the well-known Latent 
Dirichlet Allocation - in a distributed manner, using a cluster of C processors 
and dividing the corpus to be learned equally among them. We propose a simple 
approximated method that can be tuned, trading speed for accuracy according to 
the task at hand. Our approach is asynchronous, and therefore suitable for clusters 
of heterogenous machines. 

1 Introduction 

Very large datasets are becoming increasingly common - from specific collections, 
such as Reuters and PubMed, to very broad and large ones, such as the images and 
metadata of sites like Flickr, scanned books of sites like Google Books and the whole 
internet content itself. Topic models, such as Latent Dirichlet Allocation (LDA), have 
proved to be a useful tool to model such collections, but suffer from scalability limita- 
tions. Even though there has been some recent advances in speeding up inference for 
such models, this still remains a fundamental open problem. 

2 Latent Dirichlet Allocation 

Before introducing our method we briefly describe the Latent Dirichlet Allocation 
(LDA) topic model [BNJ03]. In LDA (see Figure 1), each document is modeled as a 
mixture over K topics, and each topic has a multinomial distribution Pk over a vocabu- 
lary of V words (please refer to table 1 for a summary of the notation used throughout 
this paper). For a given document m we first draw a topic distribution 6^ from a 
Dirichlet distribution parametrized by a. Then, for each word n in the document we 
draw a topic z,„ „ from a multinomial distribution with parameter 9m- Finally, we draw 
the word n from the multinomial distribution parametrized by (3z„-, „ . 
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Figure 1 : LDA model. 



2.1 Inference in LDA 

Many inference algorithms for LDA have been proposed, such as variational Bayesian 
(VB) inference [BNJ03], expectation propagation (EP) [ ], collapsed Gibbs sam- 
phng [GS04, Hei04] and collapsed variational Bayesian (CVB) inference [TNW06]. 
In this paper we will focus on collapsed Gibbs sampling. 

2.2 Collapsed Gibbs sampling 

Collapsed Gibbs sampling is an MCMC method that works by iterating over each of 
the latent topic variables zi, z„, sampling each Zi from P{zi\z^i). This is done 
by integrating out the other latent variables (6 and /?). We are not going to dwell on 
the details here, since this has already been well explained in [GS04, Hei04], but in 
essence what we need to do is to sample from this distribution: 



p{z^ = k\z^i, w) (X i"'''"'"' — - {rim.k.^t + a) (1) 
Z2v=i ['nk.v,^i + V) 

« 7 — ' — {nm,k.^i + a) (2) 

[nk.^i + Vrj) 

In simple terms, to sample the topic of a word of a document given all the other 
words and topics we need, for each fc in {1, . . . , K}: 

1- nk,v,^i- the total number of times the word's term has been observed with topic 
k (excluding the word we are sampling for). 

2. nk,-,i- the total number of times topic k has been observed in all documents 
(excluding the word we are sampling for). 

3- n^.k.^i'- the number of times topic k has been observed in a word of this docu- 
ment (excluding the word we are sampling for). 
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Table 1: Notation 


variable 


description 




training document corpus 


Dtest 


testing document corpus 


K 


number of topics 


M 


number of documents 




number of words in document m 


V 


dictionary size 


c 


number of CPUs 


a 


Dirichlet prior for (hyperparameter) 


V 


Dirichlet prior for /? (hyperparameter) 


6 


distribution of topics per document 


/3 


distribution of topics per word 




topic {\..K) of word n of document m 




term index (l..V^) of word n of document m 




number of times the term v has been observed with topic k 




local modifications to ,j 


nk 


number of times topic k has been observed in all documents 


nm,k 


number of times topic k has been observed in a word of document m 




number of words in document m 



3 Related work 

There has been research in different approaches to increase the efficiency and/or scal- 
ability of LDA. We are going to discuss them next. 

3.1 Faster sampling 

The usual approach to draw samples of z using (1) is to compute a normalization con- 
stant Z = X^^Li P^'^i ~ ^k^ij ^) to obtain a probabily distribution that can be sam- 
pled from: 

p{zi=k\z^i,w) = —- — ^ (»^m,fe,^» + a) (3) 

Z {nk^-,t + Vri) 

This leads to a complexity for each iteration of standard Gibbs sampling of 0{NtK), 
where Nt is the total number of words in the corpus, and K is the number of topics. 

[PNI^OS] proposed a way to avoid computing (1) for each K by getting an upper 
bound on Z using Holder's inequality and computing (1) for the most probable topics 
first, leading to a speed up of up to 8x of the sampling process. 

[YMM09] broke ( 1) in three components and took leverage on the resulting sparsity 
in k of some of them - that, combined with an efficient storage scheme led to a speed 
up of the order of 20x. 
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3.2 Parallelism 



A complementary approach for scalability is to share the processing among several 
CPUs/cores, in the same computer (multi core) or in different computers (clusters). 

3.2.1 Fine grained parallelism 

In most CPU architectures the cost incurred in creating threads/processes and synchro- 
nizing data among them can be very significant, making it infeasible to share a task in a 
fine-grained manner. One exception, however, are Graphics Processing Units (GPUs). 
Since they were originally designed to parallelize jobs in the pixel level, they are well 
suited for fine-grained parallelization tasks. 

[MHSOOO] proposed to use GPUs to parallelize the sampling at the topic level. Al- 
though their work was with collapsed variational Bayesian (CVB) inference [TNW06], 
it could probably be extended to collapsed Gibbs sampling. It's interesting to note that 
this kind of parallelization is complementary to the document-level one (see next sec- 
tion), so both can be applied in conjunction. 

3.2.2 Coarse grained parallelism 

Most of the work on parallelism has been on the document level - each CPU/core is 
responsible for a set of documents. 

Looking at equation (1) it can be seen that in the right hand side we have a document 
specific variable (?im,fc)- Only Uk^v (and its sum, ti^), on the left hand side, is shared 
among all documents. Using this fact, [NASW07] proposed to simply compute a subset 
of the documents in each CPU, synchronizing the global counts {uk^v) at the end of 
each step. This is an approximation, since we are no longer sampling from the true 
distribution, but from a noisy version of it. They showed, however, that it works well 
in practice. They also proposed a more principled way of sharing the task using a 
hierarchical model and, even though that was more costly, the results were similar. 

[ASW08] proposed a similar idea, but with an asynchronous model, where there is 
no global synchronization step (as there is in [NASW07]). 

4 Our method 

We follow [ASW08] and work in a coarse-grained asynchronous parallelism, dividing 
the task at the document level. For simplicity, we split the M documents among the C 
CPUs equally, so that each CPU receives |t documents ' . We then proceed in the usual 
manner, with each CPU running the standard Gibbs sampling in its set of documents. 
Each CPU, however, keeps a copy of all its modifications to Uk.v and, at the end of 
each iteration, stores them in a file in a shared filesystem. Right after that, it reads all 
modifications stored by other CPUs and incorporates them to its nk,v This works in 
an asynchronous manner, with each CPU saving its modifications and reading other 
CPU's modifications at the end of each iteration. The algorithm is detailed in 1. 

'This is not strictly necessary: when working with a cluster of heterogeneous CPUs it might be more 
interesting to split proportionally to the processing power of each CPU. 
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Algorithm 1 Simple sharing 

Input: a, rj, K, Dtraim C, numSter 

Randomly initialize Zm,n, updating nk,v and ^ accordingly. 

Save n[. ^ to a file 

for t 1 to numJter do 

Run collapsed Gibbs sampling, updating Zm,n, nk,v and n\. , 

Save ^ to a file 

Load modifications to „ from other CPUs 
end for 



We first note that, in this simple algorithm, the complexity of the sampling step 
is 0{NcK) (whre Nc is the number of words being processed in CPU c), while the 
synchronization part takes 0{CKV) (we save a KxV matrix once and load it C — 1 
times). Plugging in the following values, based on a standard large scale task: 

9 K = 500 topics 

• C= 100 CPUs 

• Nc= 10'^ words 

• V =10^ terms 

we get similar values for the sampling and the synchronization steps. That, however, 
doesn't take into account the constants. In our experiments, with these parameters 
a sampling step will take approximately 500 seconds, while the synchronization will 
take around 20,000 seconds (assuming a IGbit/s ethemet connection shared among all 
CPUs). The bottleneck is clearly in the synchronization step. 

We propose, therefore, a variation of the first algorithm. When saving the modi- 
fications at the end of an iteration, only save those that are relevant - more formally, 
save (in a sparse format) only those items of n\ ^, for which 

— ^ > threshold (4) 

where threshold is a parameter that can range from to 1. The algorithm is de- 
tailed in 2. Note that setting threshold to zero recovers Algorithm 1. 

5 Experiments 
5.1 Datasets 

We ran our experiments in three datasets: NIPS full papers (books.nips.cc), Enron 
emails (www.cs.cmu.edu/~enron) and KOS (dailykos.com)^. Each dataset was split 

^We used the preprocessed data.sets available at http://archive.ics.uci.edu/ml/datasets/Bag+ot-l- Words. 
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Algorithm 2 Sparse sharing 



Input: a, rj, K, Dtraim C, numSter 

Randomly initialize Zm,n, updating nk,v and ^ accordingly. 

Save n[. ^ to a file 

for t 1 to numJter do 

Run collapsed Gibbs sampling, updating Zm,n, nk,v and n\. , 
for k = \ io K do 
for w = 1 to y do 

Save ni if — ^ > threshold 

ft. , (J ^fc ,u 

end for 
end for 

Load modifications to Uk^v from other CPUs 
end for 



Table 2: Parameters of the three datasets used. 

NIPS Enron KOS 



number of documents in -Dfj ai 
number of documents in Dtest 
total number of words 
vocabulary size V 



1350 35,874 3,087 

150 3,987 343 

1,932,364 6,412,171 467,713 

12,419 28,102 6,906 



in 90% for training and 10% for testing. Details on the parameters of the datasets are 
shown in table 2. 

All experiments were ran in a cluster of 1 1 machines, each one with a dual-core 
AMD64 2.4 GHz CPU and 8 Gb of RAM (22 CPUs total). All machines shai-e a 
network file system over an 1GB Ethernet network. 

We used a fixed set of LDA parameters; K = (unless otherwise noticed), 
a — 0.1, rj — 0.01 and 1500 iterations of the Gibbs sampler To compare the quahty 
of different approximations we computed the perplexity of a held-out test set. The per- 
plexity is commonly used in language modeling: it is equivalent to the inverse of the 
geometric mean per-word likelihood. Formally, given a test set of Mtest documents: 

perplexity{Dtest) = exp < \ } (5) 

5.2 Results 

In figure 2 we compare running time and perplexity for different values of threshold 
and different number of CPUs. We can see that as we increase threshold we can 
significantly reduce training time, with just a small impact on the quality of the ap- 
proximation, measured by the perplexity computed on a held-out test set. We can also 
see that, as expected, the training time reduction becomes more significant as we in- 
creasing the amount of information that has to be shared, by adding more CPUs to the 
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task. 

In figure 3 we show the proportion of time spent in synchronization at each iter- 
ation when training the LDA model with different numbers of CPUs. By increasing 
threshold we can substantially decrease synchronization time. As expected, as the 
number of CPUs increase synchronization starts to dominate over processing time. 

In figure 4 we show the amount of information saved at each step for different 
values of threshold. We see that in the first few iterations the savings obtained by 
Algorithm 2 are small, since almost all modifications are relevant, but as the model 
converges the amount of relevant information stabilizes at a lower level. We can also 
see that as we add more CPUs the savings become more prominent - this is expected, 
since then modifications of a single CPU tend to be less relevant as it becomes respon- 
sible for a smaller proportion of the corpus. 

In figure 5 we plot the speed-up obtained for different number of CPUs with differ- 
ent values of threshold. We see that the simple sharing method (Algorithm 1), which 
corresponds to threshold = 0, fails to get a significant improvement, except for small 
clusters of 4 CPUs. With sparse sharing (threshold > 0), however, we can get speed- 
ups of more than 7x for 8 CPUs, and more than 12x for 16 CPUs. This can also be 
seen in figure 6, where we plot the speed-up for different number of CPUs for both 
algorithms. 

We would like to note that the datasets used are relatively small, as are the number 
of topics [k = 50), leading to tasks that are not well suited for parallelization with 
a large number of CPUs. The purpose of these experiments was simply to measure 
the effects of the approximation proposed in Algorithm 2 - for greater speed-ups when 
working with hundreds of CPUs a larger dataset or number of topics would be required. 
As an example we ran experiments with k = 500, and as can be seen in figure 7, we 
can get speed-ups closer to the theoretical limit. 

To get some perspective on the significance of the approximations being used, in 
figure 8 we compare our results to a variational Bayes inference implementation. We 
used the code from [ -. ]'', with its default parameters, and a fixed to 0.1, as in the 
Gibbs experiments. As can be seen, not only the Gibbs sampler is substantially faster, 
its perplexity results are better, even with all the approximations. 

6 Conclusion and Discussion 

We proposed a simple method to reduce the amount of time spent in synchronization in 
a distributed implementation of LDA. We present empirical results showing a reason- 
able speed-up improvement, at the cost of a small reduction in the quality of the learned 
model. The method is tunable, allowing a trade off between speed and accuracy, and is 
completely asynchronous. Source code is available at the first authors' web page."* 

As future work we plan to look for more efficient ways of sharing information 
among CPUs, while also applying the method to larger datasets, where we expect to 
see more significative speed-up improvements. 

^http://www.cs.princeton.edu/'~blei/lda-c/index.html 
''http://users.rsise.anu.edu.au/~jpetterson/ 
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NIPS datasel, K=50, S cpus 




MIPS dalaset, K=50, IScpus 







Figure 2: Normalized running time and test set perplexity as a function of the 
threshold parameter. From left to right: 4, 8 and 16 CPUs. From top to bottom: 
NIPS, ENRON and KOS datasets. 
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Figure 3: Proportion of iteration time spent with synchronization for different values 
of threshold (this plot was smoothed with a moving average filter with a span of 200 
iterations). From left to right: 4, 8 and 16 CPUs. From top to bottom: NIPS, ENRON 
and KOS datasets. 
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NIPS dataset, K=50, 4 cpus 



NIPS dataset, K=50. 8 cpus 



NIPS dataset, K=50, 16 cpus 
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Figure 4: Proportion of local modifications to Uk^v saved at each iteration, for different 
values of threshold. From left to right: 4, 8 and 16 CPUs. From top to bottom; NIPS, 
ENRON and KOS datasets. 
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Figure 5: Speed-up compared to one-core implementation for different values of 
threshold. From left to right: 4, 8 and 16 CPUs. 



K-50, threshold-0.00 



K-50, 1hreshold-0.50 




10 12 14 16 



Figure 6: Speed-up for different number of CPUs (fc = 50). Left: Algorithm 1 
(threshold = 0). Right: Algorithm 2 (threshold — 0.5) 
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Figure 7: Speed-up for different number of CPUs (k = 500). Left: Algorithm 1 
(threshold = 0). Right; Algorithm 2 (threshold = 0.5) 
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